<h1>Transcendent Memory Internals in Xen</h1>
<P>
by Dan Magenheimer, Oracle Corp.</p>
<P>
Draft 0.1 -- Updated: 20100324
<h2>Overview</h2>
<P>
This document focuses on the internal implementation of
Transcendent Memory (tmem) on Xen.  It assumes
that the reader has a basic knowledge of the terminology, objectives, and
functionality of tmem and also has access to the Xen source code.
It corresponds to the Xen 4.0 release, with
patch added to support page deduplication (V2).
<P>
The primary responsibilities of the tmem implementation are to:
<ul>
<li>manage a potentially huge and extremely dynamic
number of memory pages from a potentially large number of clients (domains)
with low memory overhead and proper isolation
<li>provide quick and efficient access to these
pages with as much concurrency as possible
<li>enable efficient reclamation and <i>eviction</i> of pages (e.g. when
memory is fully utilized)
<li>optionally, increase page density through compression and/or
deduplication
<li>where necessary, properly assign and account for
memory belonging to guests to avoid malicious and/or accidental unfairness
and/or denial-of-service
<li>record utilization statistics and make them available to management tools
</ul>
<h2>Source Code Organization</h2>

<P>
The source code in Xen that provides the tmem functionality
is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h.
The files tmem.c and tmem.h are intended to
be implementation- (and hypervisor-) independent and the other two files
provide the Xen-specific code.  This
division is intended to make it easier to port tmem functionality to other
hypervisors, though at this time porting to other hypervisors has not been
attempted.  Together, these four files
total less than 4000 lines of C code.
<P>
Even ignoring the implementation-specific functionality, the
implementation-independent part of tmem has several dependencies on
library functionality (Xen source filenames in parentheses):
<ul>
<li>
a good fast general-purpose dynamic memory
allocator with bounded response time and efficient use of memory for a very
large number of sub-page allocations.  To
achieve this in Xen, the bad old memory allocator was replaced with a
slightly-modified version of TLSF (xmalloc_tlsf.c), first ported to Linux by
Nitin Gupta for compcache.
<li>
good tree data structure libraries, specifically
<i>red-black</i> trees (rbtree.c) and <i>radix</i> trees (radix-tree.c).
Code for these was borrowed for Linux and adapted for tmem and Xen.
<li>
good locking and list code.  Both of these existed in Xen and required
little or no change.
<li>
optionally, a good fast lossless compression
library.  The Xen implementation added to
support tmem uses LZO1X (lzo.c), also ported for Linux by Nitin Gupta.
</ul>
<P>
More information about the specific functionality of these
libraries can easily be found through a search engine, via wikipedia, or in the
Xen or Linux source logs so we will not elaborate further here.

<h2>Prefixes/Abbreviations/Glossary</h2>

<P>
The tmem code uses several prefixes and abbreviations.
Knowledge of these will improve code readability:
<ul>
<li>
<i>tmh</i> ==
transcendent memory host.  Functions or
data structures that are defined by the implementation-specific code, i.e. the
Xen host code
<li>
<i>tmemc</i>
== transcendent memory control.
Functions or data structures that provide management tool functionality,
rather than core tmem operations.
<li>
<i>cli </i>or
<i>client</i> == client.
The tmem generic term for a domain or a guest OS.
</ul>
<P>
When used in prose, common tmem operations are indicated
with a different font, such as <big><kbd>put</kbd></big>
and <big><kbd>get</kbd></big>.

<h2>Key Data Structures</h2>

<P>
To manage a huge number of pages, efficient data structures
must be carefully selected.
<P>
Recall that a tmem-enabled guest OS may create one or more
pools with different attributes.  It then
<kbd>put</kbd></big>s and <kbd>get</kbd></big>s
pages to/from this pool, identifying the page
with a <i>handle</i> that consists of a <i>pool_id</i>, an <i>
object_id</i>, and a <i>page_id </i>(sometimes
called an <i>index</i>).
This suggests a few obvious core data
structures:
<ul>
<li>
When a guest OS first calls tmem, a <i>client_t</i> is created to contain
and track all uses of tmem by that guest OS.  Among
other things, a <i>client_t</i> keeps pointers
to a fixed number of pools (16 in the current Xen implementation).
<li>
When a guest OS requests a new pool, a <i>pool_t</i> is created.
Some pools are shared and are kept in a
sharelist (<i>sharelist_t</i>) which points
to all the clients that are sharing the pool.
Since an <i>object_id</i> is 64-bits,
a <i>pool_t</i> must be able to keep track
of a potentially very large number of objects.
To do so, it maintains a number of parallel trees (256 in the current
Xen implementation) and a hash algorithm is applied to the <i>object_id</i>
to select the correct tree.
Each tree element points to an object.
Because an <i>object_id</i> usually represents an <i>inode</i>
(a unique file number identifier), and <i>inode</i> numbers
are fairly random, though often &quot;clumpy&quot;, a <i>red-black tree</i>
is used.
<li>
When a guest first
<kbd>put</kbd></big>s a page to a pool with an as-yet-unused <i>object_id,</i> an
<i>obj_t</i> is created.  Since a <i
>page_id</i> is usually an index into a file,
it is often a small number, but may sometimes be very large (up to
32-bits).  A <i>radix tree</i> is a good data structure to contain items
with this kind of index distribution.
<li>
When a page is
<kbd>put</kbd></big>, a page descriptor, or <i>pgp_t</i>, is created, which
among other things will point to the storage location where the data is kept.
In the normal case the pointer is to a <i>pfp_t</i>, which is an
implementation-specific datatype representing a physical pageframe in memory
(which in Xen is a &quot;struct page_info&quot;).
When deduplication is enabled, it points to
yet another data structure, a <i>pcd_</i>t
(see below).  When compression is enabled
(and deduplication is not), the pointer points directly to the compressed data.
For reasons we will see shortly, each <i>pgp_t</i> that represents
an <i>ephemeral</i> page (that is, a page placed
in an <i>ephemeral</i> pool) is also placed
into two doubly-linked linked lists, one containing all ephemeral pages
<kbd>put</kbd></big> by the same client and one
containing all ephemeral pages across all clients (&quot;global&quot;).
<li>
When deduplication is enabled, multiple <i>pgp_</i>t's may need to point to
the same data, so another data structure (and level of indirection) is used
called a page content descriptor, or <i>pcd_t</i>.
Multiple page descriptors (<i>pgp_t</i>'s) may point to the same <i>pcd_t</i>.
The <i>pcd_t</i>, in turn, points to either a <i>pfp_t</i>
(if a full page of data), directly to a
location in memory (if the page has been compressed or trailing zeroes have
been eliminated), or even a NULL pointer (if the page contained all zeroes and
trailing zero elimination is enabled).
</ul>
<P>
The most apparent usage of this multi-layer web of data structures
is &quot;top-down&quot; because, in normal operation, the vast majority of tmem
operations invoked by a client are
<kbd>put</kbd></big>s and <kbd>get</kbd></big>s, which require the various
data structures to be walked starting with the <i>client_t</i>, then
a <i>pool_t</i>, then an <i>obj_t</i>, then a <i>pgd_t</i>.
However, there is another highly frequent tmem operation that is not
visible from a client: memory reclamation.
Since tmem attempts to use all spare memory in the system, it must
frequently free up, or <i>evict</i>,
pages.  The eviction algorithm will be
explained in more detail later but, in brief, to free memory, ephemeral pages
are removed from the tail of one of the doubly-linked lists, which means that
all of the data structures associated with that page-to-be-removed must be
updated or eliminated and freed.  As a
result, each data structure also contains a <i>back-pointer</i>
to its parent, for example every <i>obj_t</i>
contains a pointer to its containing <i>pool_t</i>.
<P>
This complex web of interconnected data structures is updated constantly and
thus extremely sensitive to careless code changes which, for example, may
result in unexpected hypervisor crashes or non-obvious memory leaks.
On the other hand, the code is fairly well
modularized so, once understood, it is possible to relatively easily switch out
one kind of data structure for another.
To catch problems as quickly as possible when debug is enabled, most of
the data structures are equipped with <i>sentinels</i>and many inter-function
assumptions are documented and tested dynamically
with <i>assertions</i>.
While these clutter and lengthen the tmem
code substantially, their presence has proven invaluable on many occasions.
<P>
For completeness, we should also describe a key data structure in the Xen
implementation-dependent code: the <i>tmh_page_list</i>. For security and
performance reasons, pages that are freed due to tmem operations (such
as <kbd>get</kbd></big>) are not immediately put back into Xen's pool
of free memory (aka the Xen <i>heap</i>).
Tmem pages may contain guest-private data that must be <i>scrubbed</i> before
those memory pages are released for the use of other guests.
But if a page is immediately re-used inside of tmem itself, the entire
page is overwritten with new data, so need not be scrubbed.
Since tmem is usually the most frequent
customer of the Xen heap allocation code, it would be a waste of time to scrub
a page, release it to the Xen heap, and then immediately re-allocate it
again.  So, instead, tmem maintains
currently-unused pages of memory on its own free list, <i>tmh_page_list</i>,
and returns the pages to Xen only when non-tmem Xen
heap allocation requests would otherwise fail.

<h2>Scalablility/Concurrency</h2>

<P>Tmem has been designed to be highly scalable.
Since tmem access is invoked similarly in
many ways to asynchronous disk access, a &quot;big SMP&quot; tmem-aware guest
OS can, and often will, invoke tmem hypercalls simultaneously on many different
physical CPUs.  And, of course, multiple
tmem-aware guests may independently and simultaneously invoke tmem
hypercalls.  While the normal frequency
of tmem invocations is rarely extremely high, some tmem operations such as data
compression or lookups in a very large tree may take tens of thousands of
cycles or more to complete.  Measurements
have shown that normal workloads spend no more than about 0.2% (2% with
compression enabled) of CPU time executing tmem operations.
But those familiar with OS scalability issues
recognize that even this limited execution time can create concurrency problems
in large systems and result in poorly-scalable performance.
<P>
A good locking strategy is critical to concurrency, but also
must be designed carefully to avoid deadlock and <i>livelock</i> problems.  For
debugging purposes, tmem supports a &quot;big kernel lock&quot; which disables
concurrency altogether (enabled in Xen with &quot;tmem_lock&quot;, but note
that this functionality is rarely tested and likely has bit-rotted). Infrequent
but invasive tmem hypercalls, such as pool creation or the control operations,
are serialized on a single <i>read-write lock</i>, called tmem_rwlock,
which must be held for writing.  All other tmem operations must hold this lock
for reading, so frequent operations such as
<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big> can execute simultaneously
as long as no invasive operations are occurring.
<P>
Once a pool has been selected, there is a per-pool
read-write lock (<i>pool_rwlock</i>) which
must be held for writing if any transformative operations might occur within
that pool, such as when an<i> obj_t</i> is
created or destroyed.  For the highly
frequent operation of finding an<i> obj_t</i>
within a pool, pool_rwlock must be held for reading.
<P>
Once an object has been selected, there is a per-object
spinlock (<i>obj_spinlock)</i>.
This is a spinlock rather than a read-write
lock because nearly all of the most frequent tmem operations (e.g.
<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big>)
are transformative, in
that they add or remove a page within the object.
This lock is generally taken whenever an
object lookup occurs and released when the tmem operation is complete.
<P>
Next, the per-client and global ephemeral lists are
protected by a single global spinlock (<i>eph_lists_</i>spinlock)
and the per-client persistent lists are also protected by a single global
spinlock (<i>pers_list_spinlock</i>).
And to complete the description of
implementation-independent locks, if page deduplication is enabled, all pages
for which the first byte match are contained in one of 256 trees that are
protected by one of 256 corresponding read-write locks
(<i>pcd_tree_rwlocks</i>).
<P>
In the Xen-specific code (tmem_xen.c), page frames (e.g.  struct page_info)
that have been released are kept in a list (<i>tmh_page_list</i>) that
is protected by a spinlock (<i>tmh_page_list_lock</i>).
There is also an &quot;implied&quot; lock
associated with compression, which is likely the most time-consuming operation
in all of tmem (of course, only when compression is enabled): A compression
buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to
this buffer is returned to implementation-independent code and used without a
lock.
<P>
The proper method to avoid deadlocks is to take and release
locks in a very specific predetermined order.
Unfortunately, since tmem data structures must simultaneously be
accessed &quot;top-down&quot; (
<kbd>put</kbd></big> and <kbd>get</kbd></big>)
and &quot;bottoms-up&quot;
(memory reclamation), more complex methods must be employed:
A <i>trylock</i>mechanism is used (c.f. <i>tmem_try_to_evict_pgp()</i>),
which takes the lock if it is available but returns immediately (rather than
spinning and waiting) if the lock is not available.
When walking the ephemeral list to identify
pages to free, any page that belongs to an object that is locked is simply
skipped.  Further, if the page is the
last page belonging to an object, and the pool read-write lock for the pool the
object belongs to is not available (for writing), that object is skipped.
These constraints modify the LRU algorithm
somewhat, but avoid the potential for deadlock.
<P>
Unfortunately, a livelock was still discovered in this approach:
When memory is scarce and each client is
<kbd>put</kbd></big>ting a large number of pages
for exactly one object (and thus holding the object spinlock for that object),
memory reclamation takes a very long time to determine that it is unable to
free any pages, and so the time to do a
<kbd>put</kbd></big> (which eventually fails) becomes linear to the
number of pages in the object!  To avoid
this situation, a workaround was added to always ensure a minimum amount of
memory (1MB) is available before any object lock is taken for the client
invoking tmem (see <i>tmem_ensure_avail_pages()</i>).
Other such livelocks (and perhaps deadlocks)
may be lurking.
<P>
A last issue related to concurrency is atomicity of counters.
Tmem gathers a large number of
statistics.  Some of these counters are
informational only, while some are critical to tmem operation and must be
incremented and decremented atomically to ensure, for example, that the number
of pages in a tree never goes negative if two concurrent tmem operations access
the counter exactly simultaneously.  Some
of the atomic counters are used for debugging (in assertions) and perhaps need
not be atomic; fixing these may increase performance slightly by reducing
cache-coherency traffic.  Similarly, some
of the non-atomic counters may yield strange results to management tools, such
as showing the total number of successful
<kbd>put</kbd></big>s as being higher than the number of
<kbd>put</kbd></big>s attempted.
These are left as exercises for future tmem implementors.

<h2>Control and Manageability</h2>

<P>
Tmem has a control interface to, for example, set various
parameters and obtain statistics.  All
tmem control operations funnel through <i>do_tmem_control()</i>
and other functions supporting tmem control operations are prefixed
with <i>tmemc_</i>.

<P>
During normal operation, even if only one tmem-aware guest
is running, tmem may absorb nearly all free memory in the system for its own
use.  Then if a management tool wishes to
create a new guest (or migrate a guest from another system to this one), it may
notice that there is insufficient &quot;free&quot; memory and fail the creation
(or migration).  For this reason, tmem
introduces a new tool-visible class of memory -- <i>freeable</i> memory --
and provides a control interface to access
it.  All ephemeral memory and all pages on the <i>tmh_page_list</i>
are freeable. To properly access freeable
memory, a management tool must follow a sequence of steps:
<ul>
<li>
<i>freeze</i>
tmem:When tmem is frozen, all 
<kbd>put</kbd></big>s fail, which ensures that no
additional memory may be absorbed by tmem.
(See <i>tmemc_freeze_pools()</i>, and
note that individual clients may be frozen, though this functionality may be
used only rarely.)
<li>
<i>query freeable MB: </i>If all freeable memory were released to the Xen
heap, this is the amount of memory (in MB) that would be freed.
See <i>tmh_freeable_pages()</i>.
<li>
<i>flush</i>:
Tmem may be requested to flush, or relinquish, a certain amount of memory, e.g.
back to the Xen heap.  This amount is
specified in KB.  See <i
>tmemc_flush_mem()</i> and <i
>tmem_relinquish_npages()</i>.
<li>
At this point the management tool may allocate
the memory, e.g. using Xen's published interfaces.
<li>
<i>thaw</i>
tmem: This terminates the freeze, allowing tmem to accept 
<kbd>put</kbd></big>s again.
</ul>
<P>
Extensive tmem statistics are available through tmem's
control interface (see <i>tmemc_list </i>and
the separate source for the &quot;xm tmem-list&quot; command and the
xen-tmem-list-parse tool).  To maximize
forward/backward compatibility with future tmem and tools versions, statistical
information is passed via an ASCII interface where each individual counter is
identified by an easily parseable two-letter ASCII sequence.

<h2>Save/Restore/Migrate</h2>

<P>
Another piece of functionality that has a major impact on
the tmem code is support for save/restore of a tmem client and, highly related,
live migration of a tmem client.
Ephemeral pages, by definition, do not need to be saved or
live-migrated, but persistent pages are part of the state of a running VM and
so must be properly preserved.
<P>
When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step
is for the tmem client to be frozen (see the manageability section).
Next, tmem API version information is
recorded (to avoid possible incompatibility issues as the tmem spec evolves in
the future).  Then, certain high-level
tmem structural information specific to the client is recorded, including
information about the existing pools.
Finally, the contents of all persistent pages are recorded.
<P>
For live-migration, the process is somewhat more complicated.
Ignoring tmem for a moment, recall that in
live migration, the vast majority of the VM's memory is transferred while the
VM is still fully operational.  During
each phase, memory pages belonging to the VM that are changed are marked and
then retransmitted during a later phase.
Eventually only a small amount of memory remains, the VM is paused, the
remaining memory is transmitted, and the VM is unpaused on the target machine.
<P>
The number of persistent tmem pages may be quite large,
possibly even larger than all the other memory used by the VM; so it is
unacceptable to transmit persistent tmem pages during the &quot;paused&quot;
phase of live migration.  But if the VM
is still operational, it may be making calls to tmem:
A frozen tmem client will reject any 
<big><kbd>put</kbd></big> operations, but tmem must
still correctly process <big><kbd>flush</kbd></big>es
(page and object), including implicit flushes due to duplicate 
<big><kbd>put</kbd></big>s.
Fortunately, these operations can only
invalidate tmem pages, not overwrite tmem pages or create new pages.
So, when a live-migrate has been initiated,
the client is frozen.  Then during the
&quot;live&quot; phase, tmem transmits all persistent pages, but also records
the handle of all persistent pages that are invalidated.
Then, during the &quot;paused&quot; phase,
only the handles of invalidated persistent pages are transmitted, resulting in
the invalidation on the target machine of any matching pages that were
previously transmitted during the &quot;live&quot; phase.
<P>
For restore (and on the target machine of a live migration),
tmem must be capable of reconstructing the internal state of the client from
the saved/migrated data.  However, it is
not the client itself that is <big><kbd>put</kbd></big>'ing
the pages but the management tools conducting the restore/migration.
This slightly complicates tmem by requiring
new API calls and new functions in the implementation, but the code is
structured so that duplication is minimized.
Once all tmem data structures for the client are reconstructed, all
persistent pages are recreated and, in the case of live-migration, all
invalidations have been processed and the client has been thawed, the restored
client can be resumed.
<P>
Finally, tmem's data structures must be cluttered a bit to
support save/restore/migration.  Notably,
a per-pool list of persistent pages must be maintained and, during live
migration, a per-client list of invalidated pages must be logged.
A reader of the code will note that these
lists are overlaid into space-sensitive data structures as a union, which may
be more error-prone but eliminates significant space waste.

<h2>Miscellaneous Tmem Topics</h2>

<P>
<i><b>Duplicate <big><kbd>puts</kbd></big></b></i>.
One interesting corner case that
significantly complicates the tmem source code is the possibility
of a <i>duplicate</i>
<big><kbd>put</kbd></big>,
which occurs when two
<big><kbd>put</kbd></big>s
are requested with the same handle but with possibly different data.
The tmem API addresses
<i>
<big><kbd>put</kbd></big>-<big><kbd>put</kbd></big>-<big><kbd>get</kbd></big>
coherence</i> explicitly: When a duplicate
<big><kbd>put</kbd></big> occurs, tmem may react one of two ways: (1) The 
<big><kbd>put</kbd></big> may succeed with the old
data overwritten by the new data, or (2) the
<big><kbd>put</kbd></big> may be failed with the original data flushed and
neither the old nor the new data accessible.
Tmem may <i>not</i> fail the 
<big><kbd>put</kbd></big> and leave the old data accessible.
<P>
When tmem has been actively working for an extended period,
system memory may be in short supply and it is possible for a memory allocation
for a page (or even a data structure such as a <i>pgd_t</i>) to fail. Thus,
for a duplicate 
<big><kbd>put</kbd></big>, it may be impossible for tmem to temporarily
simultaneously maintain data structures and data for both the original 
<big><kbd>put</kbd></big> and the duplicate 
<big><kbd>put</kbd></big>.
When the space required for the data is
identical, tmem may be able to overwrite <i>in place </i>the old data with
the new data (option 1).  But in some circumstances, such as when data
is being compressed, overwriting is not always possible and option 2 must be
performed.
<P>
<i><b>Page deduplication and trailing-zero elimination.</b></i>
When page deduplication is enabled
(&quot;tmem_dedup&quot; option to Xen), ephemeral pages for which the contents
are identical -- whether the pages belong
to the same client or different clients -- utilize the same pageframe of
memory.  In Xen environments where
multiple domains have a highly similar workload, this can save a substantial
amount of memory, allowing a much larger number of ephemeral pages to be
used.  Tmem page deduplication uses
methods similar to the KSM implementation in Linux [ref], but differences between
the two are sufficiently great that tmem does not directly leverage the
code.  In particular, ephemeral pages in
tmem are never dirtied, so need never be <i>copied-on-write</i>.
Like KSM, however, tmem avoids hashing,
instead employing <i>red-black trees</i>
that use the entire page contents as the <i>lookup
key</i>.  There may be better ways to implement this.
<P>
Dedup'ed pages may optionally be compressed
(&quot;tmem_compress&quot; and &quot;tmem_dedup&quot; Xen options specified),
to save even more space, at the cost of more time.
Additionally, <i>trailing zero elimination (tze)</i> may be applied to dedup'ed
pages.  With tze, pages that contain a
significant number of zeroes at the end of the page are saved without the trailing
zeroes; an all-zero page requires no data to be saved at all.
In certain workloads that utilize a large number
of small files (and for which the last partial page of a file is padded with
zeroes), a significant space savings can be realized without the high cost of
compression/decompression.
<P>
Both compression and tze significantly complicate memory
allocation.  This will be discussed more below.
<P>
<b><i>Memory accounting</i>.</b>
Accounting is boring, but poor accounting may
result in some interesting problems.  In
the implementation-independent code of tmem, most data structures, page frames,
and partial pages (e.g. for compresssion) are <i>billed</i> to a pool,
and thus to a client.  Some <i>infrastructure</i> data structures, such as
pools and clients, are allocated with <i>tmh_alloc_infra()</i>, which does not
require a pool to be specified.  Two other
exceptions are page content descriptors (<i>pcd_t</i>)
and sharelists (<i>sharelist_t</i>) which
are explicitly not associated with a pool/client by specifying NULL instead of
a <i>pool_t</i>.
(Note to self:
These should probably just use the <i>tmh_alloc_infra()</i> interface too.)
As we shall see, persistent pool pages and
data structures may need to be handled a bit differently, so the
implementation-independent layer calls a different allocation/free routine for
persistent pages (e.g. <i>tmh_alloc_page_thispool()</i>)
than for ephemeral pages (e.g. <i>tmh_alloc_page()</i>).
<P>
In the Xen-specific layer, we
disregard the <i>pool_t</i> for ephemeral
pages, as we use the generic Xen heap for all ephemeral pages and data
structures.(Denial-of-service attacks
can be handled in the implementation-independent layer because ephemeral pages
are kept in per-client queues each with a counted length.
See the discussion on weights and caps below.)
However we explicitly bill persistent pages
and data structures against the client/domain that is using them.
(See the calls to the Xen routine <i>alloc_domheap_pages() </i>in tmem_xen.h; of
the first argument is a domain, the pages allocated are billed by Xen to that
domain.)This means that a Xen domain
cannot allocate even a single tmem persistent page when it is currently utilizing
its maximum assigned memory allocation!
This is reasonable for persistent pages because, even though the data is
not directly accessible by the domain, the data is permanently saved until
either the domain flushes it or the domain dies.
<P>
Note that proper accounting requires (even for ephemeral pools) that the same
pool is referenced when memory is freed as when it was allocated, even if the
ownership of a pool has been moved from one client to another (c.f. <i
>shared_pool_reassign()</i>).
The underlying Xen-specific information may
not always enforce this for ephemeral pools, but incorrect alloc/free matching
can cause some difficult-to-find memory leaks and bent pointers.
<P>
Page deduplication is not possible for persistent pools for
accounting reasons: Imagine a page that is created by persistent pool A, which
belongs to a domain that is currently well under its maximum allocation.
Then the <i>pcd_t</i>is matched by persistent pool B, which is
currently at its maximum.
Then the domain owning pool A is destroyed.
Is B beyond its maximum?
(There may be a clever way around this
problem.  Exercise for the reader!)
<P>
<b><i>Memory allocation.</i></b> The implementation-independent layer assumes
there is a good fast general-purpose dynamic memory allocator with bounded
response time and efficient use of memory for a very large number of sub-page
allocations.  The old xmalloc memory
allocator in Xen was not a good match for this purpose, so was replaced by the
TLSF allocator.  Note that the TLSF
allocator is used only for allocations smaller than a page (and, more
precisely, no larger than <i>tmem_subpage_maxsize()</i>);
full pages are allocated by Xen's normal heap allocator.
<P>
After the TLSF allocator was integrated into Xen, more work
was required so that each client could allocate memory from a separate
independent pool. (See the call to <i>xmem_pool_create()</i>in
<i>tmh_client_init()</i>.) 
This allows the data structures allocated for the
purpose of supporting persistent pages to be billed to the same client as the
pages themselves.  It also allows partial
(e.g. compressed) pages to be properly billed.
Further, when partial page allocations cause internal fragmentation,
this fragmentation can be isolated per-client.
And, when a domain dies, full pages can be freed, rather than only
partial pages. One other change was
required in the TLSF allocator: In the original version, when a TLSF memory
pool was allocated, the first page of memory was also allocated.
Since, for a persistent pool, this page would
be billed to the client, the allocation of the first page failed if the domain
was started at its maximum memory, and this resulted in a failure to create the
memory pool.  To avoid this, the code was
changed to delay the allocation of the first page until first use of the memory
pool.
<P>
<b><i>Memory allocation interdependency.</i></b>
As previously described,
pages of memory must be moveable back and forth between the Xen heap and the
tmem ephemeral lists (and page lists).
When tmem needs a page but doesn't have one, it requests one from the
Xen heap (either indirectly via xmalloc, or directly via Xen's <i
>alloc_domheap_pages()</i>).
And when Xen needs a page but doesn't have
one, it requests one from tmem (via a call to <i
>tmem_relinquish_pages()</i> in Xen's <i
>alloc_heap_pages() </i>in page_alloc.c).
This leads to a potential infinite loop!
To break this loop, a new memory flag (<i>MEMF_tmem</i>) was added to Xen
to flag and disallow the loop.
See <i>tmh_called_from_tmem()</i>
in <i>tmem_relinquish_pages()</i>.
Note that the <i
>tmem_relinquish_pages()</i> interface allows for memory requests of
order &gt; 0 (multiple contiguous pages), but the tmem implementation disallows
any requests larger than a single page.
<P>
<b><i>LRU page reclamation</i></b>.
Ephemeral pages generally <i>age </i>in
a queue, and the space associated with the oldest -- or <i
>least-recently-used -- </i>page is reclaimed when tmem needs more
memory.  But there are a few exceptions
to strict LRU queuing.  First is when
removal from a queue is constrained by locks, as previously described above.
Second, when an ephemeral pool is <i>shared,</i> unlike a private ephemeral
pool, a
<big><kbd>get</kbd></big>
does not imply a
<big><kbd>flush</kbd></big>
Instead, in a shared pool, a 
results in the page being promoted to the front of the queue.
Third, when a page that is deduplicated (i.e.
is referenced by more than one <i>pgp_</i>t)
reaches the end of the LRU queue, it is marked as <i
>eviction attempted</i> and promoted to the front of the queue; if it
reaches the end of the queue a second time, eviction occurs.
Note that only the <i
>pgp_</i>t is evicted; the actual data is only reclaimed if there is no
other <i>pgp_t </i>pointing to the data.
<P>
All of these modified- LRU algorithms deserve to be studied
carefully against a broad range of workloads.
<P>
<b><i>Internal fragmentation</i>.</b>
When
compression or tze is enabled, allocations between a half-page and a full-page
in size are very common and this places a great deal of pressure on even the
best memory allocator.  Additionally,
problems may be caused for memory reclamation: When one tmem ephemeral page is
evicted, only a fragment of a physical page of memory might be reclaimed.
As a result, when compression or tze is
enabled, it may take a very large number of eviction attempts to free up a full
contiguous page of memory and so, to avoid near-infinite loops and livelocks, eviction
must be assumed to be able to fail.
While all memory allocation paths in tmem are resilient to failure, very
complex corner cases may eventually occur.
As a result, compression and tze are disabled by default and should be
used with caution until they have been tested with a much broader set of
workloads.(Note to self: The 
code needs work.)
<P>
<b><i>Weights and caps</i>.</b>
Because
of the just-discussed LRU-based eviction algorithms, a client that uses tmem at
a very high frequency can quickly swamp tmem so that it provides little benefit
to a client that uses it less frequently.
To reduce the possibility of this denial-of-service, limits can be
specified via management tools that are enforced internally by tmem.
On Xen, the &quot;xm tmem-set&quot; command
can specify &quot;weight=&lt;weight&gt;&quot; or &quot;cap=&lt;cap&gt;&quot;
for any client.  If weight is non-zero
for a client and the current percentage of ephemeral pages in use by the client
exceeds its share (as measured by the sum of weights of all clients), the next
page chosen for eviction is selected from the requesting client's ephemeral
queue, instead of the global ephemeral queue that contains pages from all
clients.(See <i>client_over_quota().</i>)
Setting a cap for a client is currently a no-op.
<P>
<b><i>Shared pools and authentication.</i></b>
When tmem was first proposed to the linux kernel mailing list
(LKML), there was concern expressed about security of shared ephemeral
pools.  The initial tmem implementation only
required a client to provide a 128-bit UUID to identify a shared pool, and the
linux-side tmem implementation obtained this UUID from the superblock of the
shared filesystem (in ocfs2).  It was
pointed out on LKML that the UUID was essentially a security key and any
malicious domain that guessed it would have access to any data from the shared
filesystem that found its way into tmem.
Ocfs2 has only very limited security; it is assumed that anyone who can
access the filesystem bits on the shared disk can mount the filesystem and use
it.  But in a virtualized data center,
higher isolation requirements may apply.
As a result, a Xen boot option -- &quot;tmem_shared_auth&quot; -- was
added.  The option defaults to disabled,
but when it is enabled, management tools must explicitly authenticate (or may
explicitly deny) shared pool access to any client.
On Xen, this is done with the &quot;xm
tmem-shared-auth&quot; command.
<P>
<b><i>32-bit implementation</i>.</b>
There was some effort put into getting tmem working on a 32-bit Xen.
However, the Xen heap is limited in size on
32-bit Xen so tmem did not work very well.
There are still 32-bit ifdefs in some places in the code, but things may
have bit-rotted so using tmem on a 32-bit Xen is not recommended.

<h2>Known Issues</h2>

<p><b><i>Fragmentation.</i></b>When tmem
is active, all physically memory becomes <i>fragmented</i>
into individual pages.  However, the Xen
memory allocator allows memory to be requested in multi-page contiguous
quantities, called order&gt;0 allocations.
(e.g. 2<sup>order</sup> so
order==4 is sixteen contiguous pages.)
In some cases, a request for a larger order will fail gracefully if no
matching contiguous allocation is available from Xen.
As of Xen 4.0, however, there are several
critical order&gt;0 allocation requests that do not fail gracefully.
Notably, when a domain is created, and
order==4 structure is required or the domain creation will fail.
And shadow paging requires many order==2
allocations; if these fail, a PV live-migration may fail.
There are likely other such issues.
<P>
But, fragmentation can occur even without tmem if any domU does
any extensive ballooning; tmem just accelerates the fragmentation.
So the fragmentation problem must be solved
anyway.  The best solution is to disallow
order&gt;0 allocations altogether in Xen -- or at least ensure that any attempt
to allocate order&gt;0 can fail gracefully, e.g. by falling back to a sequence
of single page allocations. However this restriction may require a major rewrite
in some of Xen's most sensitive code.
(Note that order&gt;0 allocations during Xen boot and early in domain0
launch are safe and, if dom0 does not enable tmem, any order&gt;0 allocation by
dom0 is safe, until the first domU is created.)
<P>
Until Xen can be rewritten to be <i>fragmentation-safe</i>, a small hack
was added in the Xen page
allocator.(See the comment &quot;
memory is scarce&quot; in <i>alloc_heap_pages()</i>.)
Briefly, a portion of memory is pre-reserved
for allocations where order&gt;0 and order&lt;9.
(Domain creation uses 2MB pages, but fails
gracefully, and there are no other known order==9 allocations or order&gt;9
allocations currently in Xen.)
<P>
<b><i>NUMA</i></b>.  Tmem assumes that
all memory pages are equal and any RAM page can store a page of data for any
client.  This has potential performance
consequences in any NUMA machine where access to <i
>far memory</i> is significantly slower than access to <i
>near memory</i>.
On nearly all of today's servers, however,
access times to <i>far memory</i> is still
much faster than access to disk or network-based storage, and tmem's primary performance
advantage comes from the fact that paging and swapping are reduced.
So, the current tmem implementation ignores
NUMA-ness; future tmem design for NUMA machines is an exercise left for the
reader.

<h2>Bibliography</h2>

<P>
(needs work)<b style='mso-bidi-font-weight:>
<P><a href="http://oss.oracle.com/projects/tmem">http://oss.oracle.com/projects/tmem</a>
